The Demand for a Sound Baseline in GPU Memory Architecture Research
نویسندگان
چکیده
Modern GPUs adopt massive multithreading and multi-level cache hierarchies to hide long operation latencies, especially off-chip memory access latencies. However, poor cache indexing and cache line allocation policy as well as a small number of miss-status handling registers (MSHRs) can exacerbate the problem of cache thrashing and cache-missrelated resource congestion. Besides, modulo address mapping among memory partitions may cause severe partition camping, resulting in underutilization of DRAM bandwidth and capacity of banked L2 cache. Furthermore, prior GPU cache bypassing studies unrealistically assume there is no limit on the number of in-flight bypassed requests, which may lead to pathological experimental results in simulation. In this work, we investigate the performance impact of the aforementioned factors and demonstrate the necessity for a sound baseline in GPU memory architecture research. Our results show that advanced cache indexing functions can greatly reduce conflict misses and improve cache efficiency; the allocation-on-fill policy brings a better performance than allocation-on-miss. Besides, the performance does not consistently improve with more MSHRs. Instead, there can be performance degradation in certain scenarios. In addition, Xor mapping can greatly mitigate the problem of memory partition camping. Furthermore, the fact that a limited number of inflight bypassed requests can be supported should be taken into account in GPU cache bypassing studies, for more reliable results and conclusions.
منابع مشابه
Embedded Memory Test Strategies and Repair
The demand of self-testing proportionally increases with memory size in System on Chip (SoC). SoC architecture normally occupies the majority of its area by memories. Due to increase in density of embedded memories, there is a need of self-testing mechanism in SoC design. Therefore, this research study focuses on this problem and introduces a smooth solution for self-testing. In the proposed m...
متن کاملAn approach to Improve Particle Swarm Optimization Algorithm Using CUDA
The time consumption in solving computationally heavy problems has always been a concern for computer programmers. Due to simplicity of its implementation, the PSO (Particle Swarm Optimization) is a suitable meta-heuristic algorithm for solving computationally heavy problems. However, despite the simplicity, the algorithm is inefficient for solving real computationally heavy problems but the pr...
متن کاملUltra-Low-Energy DSP Processor Design for Many-Core Parallel Applications
Background and Objectives: Digital signal processors are widely used in energy constrained applications in which battery lifetime is a critical concern. Accordingly, designing ultra-low-energy processors is a major concern. In this work and in the first step, we propose a sub-threshold DSP processor. Methods: As our baseline architecture, we use a modified version of an existing ultra-low-power...
متن کاملThe Effectiveness of Training of Reading Assistant Package on Dyslexic childrens’ Working Memory - a Multiple Baseline Single Case Study
Introduction: ln literature review, cognitive problems such as poor working memory is considered as one of the main reading problems among dyslexic children. Therefore, the present study was conducted to determine the effect of Reading Assistant Package training on the working memory of dyslexic children. Methods: The present study was a single-subject research design of multiple baseline types...
متن کاملNVIDA CUDA Architecture-Based Parallel SAT Solver
The SAT problem is the first NP-complete problem. So far there is no algorithm that can solve it in polynomial time. Over the past decade, the development of efficient and scalable algorithms has dramatically leveraged the ability of solving SAT problem instances involving tens of thousands of variables and millions of constraints. But as industry demand is increasing, a faster SAT solver is ne...
متن کامل